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The present invention relates to an apparatus and method 
for indexing sequences of sub-word units, such as 
sequences of phonemes or the like. The invention can be 
used to identify regions of a database for search in 
response to a user's input query. The input query may be 
a voiced or typed query. 

Databases of information are well known and suffer from 
the problem of how to locate and retrieve the desired 
information from the database quickly and efficiently. 
Existing database search tools allow the user to search 
the database using typed key words. Whilst this is quick 
and efficient, this type of searching is not suitable for 
various kinds of databases, such as video or audio 
databases . 

A recent proposal has been made to annotate such video 
and audio databases with a phonetic transcription of the 
speech content of the audio and video files, with 
subsequent retrieval being achieved by comparing a 
phonetic transcription of the user's input query with the 
phoneme annotation data in the database- The technique 
proposed for matching the sequences of phonemes firstly 



defines a set of features in the query, each feature 
being taken as an overlapping fixed size fragment from 
the phoneme string • It then identifies the frequency of 
occurrence of the features in both the query and the 
annotation and then finally determines a measure of the 
similarity between the query and the annotation using a 
cosine measure of these frequencies of occurrences, 

However, as those skilled in the art will appreciate, if 
the database is large, then this retrieval method becomes 
unfeasibly long. An indexing method is therefore 
required. 

As is well known, indexing provides a one-to-many mapping 
between the index (sometimes referred to as the key) and 
the data in the database. Where the database comprises 
words, this indexing is simple, but for phonemes this 
raises a number of difficulties. Firstly, because there 
are only a small number of phonemes (approximately 43 in 
the English language) means that a naive mapping using a 
single phoneme as a key is not sufficiently 
discriminating, since any given phoneme will occur 
several thousand times in the database. Secondly, 
because of the relatively poor recognition rate of 
phonemes ( 60% to 70% ) means that any 6n~e-to-many mapping 



will make it difficult to retrieve data where the query 
phoneme or annotation phoneme was misrecognised, inserted 
or omitted. Finally, performing any statistical 

retrieval methods becomes computationally unfeasible. 

The present invention aims to provide an efficient sub- 
word indexing technique which can be used in a retrieval 
system to identify areas of a database for searching - 

Exemplary embodiments of the present invention will now 
be described with reference to the accompanying drawings, 
in which: 

Figure 1 is a schematic block diagreim illustrating a user 
terminal which allows the user to retrieve information 
from an input typed or voice query; 

Figure 2 is a schematic diagram of phoneme and word 
lattice annotation data which is generated from a voiced 
input by the user for annotating a document; 

Figure 3a diagrammatically illustrates the block nature 
of an annotation stored in the annotation database which 
forms part of the user terminal shown in Figure 1; 



Figure 3b is a schematic diagram illustrating a sequence 
of annotation phonemes which is included in one of the 
blocks of the annotation shown in Figure 3a; 

Figure 3c schematically illustrates a sequence of phoneme 
clusters for the phoneme sequence shown in Figure 3b and 
illustrates how these phoneme clusters can be grouped to 
form a number of overlapping phoneme cluster N-grams; 

Figure 4 is a flowchart illustrating the main processing 
steps involved in creating a phoneme index; 

Figure 5 illustrates an example of a phoneme index which 
is generated during the processing of the steps shown in 
Figure 4 ; 

Figure 6 is a flowchart illustrating the processing steps 
involved in performing a phoneme search of an annotation 
database; 

Figure 7a schematically illustrates a sequence of 
phonemes representing an input query; 

Figure 7b schematically illustrates the way in which the 
sequence of phonemes shown in Figure 7a can be divided 
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into a number of overlapping phoneme N-grams; 

Figure 7c illustrates a number of overlapping phoneme 
cluster N-grcims derived from the phoneme N-grams shown in 
Figure 7b; 

Figure 8 is a flowchart illustrating the main processing 
steps involved in using the phoneme index to identify 
locations of the annotation for phoneme matching; 

Figure 9a is a flowchart illustrating part of the process 
steps involved in determining the different phoneme 
clusters; 



15 Figure 9b is a flowchart illustrating the remaining 

process steps involved in determining the different 
phoneme clusters; 

Figure 10 is a schematic block diagram illustrating the 
20 form of an alternative user terminal which is operable to 

retrieve a data file from a database located within a 
remote server in response to an input voice query; and 
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Figure 11 illustrates another user terminal which allows 
a user to retrieve data from a database located within a 



remote server in response to an input voice query. 

Embodiments of the present invention can be implemented 
using dedicated hardware circuits, but the embodiment to 
be described is implemented in computer software or code, 
which is run in conjunction with processing hardware such 
as a personal computer, work station, photocopier, 
facsimile machine, personal digital assistant (PDA), web 
browser or the like. 

DATA FILE RETRIEVAL 

Figure 1 is a block diagrsun illustrating the foirm of a 
user terminal 59 which is used, in this embodiment, to 
retrieve documents from a document database 2 9 in 
response to a voice or typed query input by the user 39. 
The "document" may be text documents, audio files, video 
files, photographs, mixtures of these etc. The user 
terminal 5 9 may be, for example, a personal computer, a 
hand-held device or the like. As shown, the user 
terminal 59 comprises the document database 29, an 
annotation database 31 comprising a descriptive 
annotation of each of the documents in the document 
database 29, a phoneme matcher 33, a phoneme index 35, a 
word matcher 37, a word index 38, a combiner unit 40, an 
automatic speech recognition unit 51, a phonetic 



transcription unit IS, a keyboard 3, a microphone 7 and 
a display 57. 

In operation, the user inputs either a voice query via 
the microphone 7 or a typed query via the keyboard 3 and 
the query is processed either by the automatic speech 
recognition unit 51 or the phonetic transcription unit 75 
to generate corresponding phoneme and word data. The 
phoneme data is input to the phoneme matcher 33 which is 
operable to perform a phoneme search in the annotation 
database 31 with reference to a phoneme index 35. 
Similarly / the word data is input to the word matcher 37 
which is operable to search the annotation database 31 
with reference to the word index 38. The results of the 
phoneme and word search of the annotation database are 
then input to the combiner unit 40 which uses these 
results to retrieve a ranked list of documents 41 from 
the document database 2 9 which are output to the display 
unit 57 for display to the user 39. 

In this embodiment, the annotation data for each document 
comprises a combined phoneme (or phoneme-like) and word 
lattice. Figure 2 illustrates the form of the phoneme 
and word lattice annotation data generated for the spoken 
annotation "picture of the Taj Mahal". As shown, the 
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phoneme and word lattice is an acyclic directed graph 
with a single entry point and a single exit point. It 
represents different parses of the user's input. It is 
not simply a sequence of words with alternatives, since 
each word does not have to be replaced by a single 
alternative, one word can be substituted for two or more 
words or phonemes, and the whole structure can form a 
substitution for one or more words or phonemes. 
Therefore, the density of the data within the phoneme and 
word lattice annotation data essentially remains linear 
throughout the annotation data, rather than growing 
exponentially as in the case of a system which generates 
the N-best word lists for the annotation input. 

In this embodiment, the annotation data for each document 
(d) is stored in the annotation database 31 and has the 
following general form: 

HEADER 

- flag if word if phoneme if mixed 

- time index associating the location of 
blocks of annotation data within memory to 
a given time point . 

- word set used (i.e. the dictionary) 

- phoneme set used 



- the language to which the vocabulary 
pertains 

- phoneme probability data 
Block(i) i = 0, 1,2, 

node nj j = 0,1,2, 

- time offset of node from start of block 

- phoneme links (k) k - 0,1/2 

offset to node nj = nj^-nj {ny^ is node to 
which link K extends) or if n^ is in 
block(i+l) offset to node n^ = n^+Nb-nj 
(where Nb is the number of nodes in 
block(i) ) 

phoneme associated with link (k) 

- word links (1) 1 = 0,1,2, 

offset to node nj = ni - nj (nj is node 
to which link 1 extends) or if n^ is in 
block (i+1) offset to node nj = nj^+Nb-rij 
(where Nb is the number of nodes in 
block(i) ) 

word associated with link (1) 

The flag identifying if the annotation data is word 
annotation data, phoneme annotation data or if it is 
mixed is provided since the annotation data may include 
just word data, just phoneme data or both word and 
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phoneme data. 

In this embodiment the annotation data is divided into 
blocks (B) of nodes (n) in order to allow the search to 
jump into the middle of the annotation data. The header 
therefore includes a time index which associates the 
location of the blocks of annotation data within the 
memory to a given time offset between the time of start 
and the time corresponding to the beginning of the block. 



The header also includes data defining the word set used 
(i.e. the dictionary)^ the phoneme set used and their 
probabilities and the language to which the vocabulary 
pertains. The header may also include details of the 
15 automatic speech recognition system or the phonetic 

transcription system used to generate the annotation data 
and any appropriate settings thereof which were used 
during the generation of the annotation data. 

20 The blocks of annotation data then follow the header and 

identify, for each node in the block, the time offset of 
the node from the start of the block, the phoneme links 
which connect that node to other nodes by phonemes and 
word links which connect that node to other nodes by 

25 words. Each phoneme link and word link identifies the 
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phoneme or word which is associated with the link. They 
also identify the offset to the current node. For 
example, if node n^o is linked to node nss by a phoneme 
link, then the offset to node nso is 5 . As those skilled 
5 in the art will appreciate, using an offset indication 

like this allows the division of the continuous 
annotation data into separate blocks. 

In an embodiment where an automatic speech recognition 
10 unit outputs weightings indicative of the confidence of 

the speech recognition unit's output, these weightings or 
confidence scores would also be included within the data 
structure. In particular, a confidence score would be 
provided for each node which is indicative of the 
15 confidence of arriving at the node and each of the 

phoneme and word links would include a transition score 
depending upon the weighting given to the corresponding 
phoneme or word. These weightings would then be used to 
control the search and retrieval of the data files by 
20 discarding those matches which have a low confidence 

score . 

In order to provide an efficient retrieval method, a word 
indexing scheme and a phoneme indexing scheme is used in 
2 5 order to identify portions in the annotation database 31 
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against which a direct comparison with the input query is 
made. The word index 38 and the way that it is used is 
well known to those skilled in the art and will not be 
described further. However, the way in which the phoneme 
index 35 is generated and subsequently used to identify 
portions of the annotation database 31 for comparison 
with the input query will now be described in more 
detail . 

As mentioned above, the use of a single phoneme as the 
key for a phoneme index will not provide sufficient 
discrimination, since each phoneme will occur several 
thousand times in the annotation database 31. Further, 
since current automatic speech recognition systems have 
a relatively poor phoneme recognition rate (60 to 70%), 
indexing using the phonemes directly will make it 
difficult to retrieve the data where the query phoneme or 
the annotation phoneme was misrecognised. Since the 
automatic speech recognition system tends to produce 
decoding errors for similar sounding phonemes, such as 
/s/ and /z/ and not highly dissimilar phonemes, such as 
/z/ and /g/, the error rate of indexing can be greatly 
reduced by indexing on confusable clusters of phonemes 
rather than individual phonemes. 
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Considering^ for example, the situation where a query and 
an annotation exist for the word "sheep" and the query 
(Q) comprises the sequence of phonemes /sh//iy//p/ and 
the annotation (A) comprises the sequence of phonemes 
5 /s//eh//p/. If the sequence of three query phonemes is 

used as an index into the annotation (in the form of a 
trigram) , then it will not be possible to retrieve the 
data since the sequence of query phonemes does not match 
the sequence of annotation phonemes. However, if the 

10 phonemes are clustered into conf usable sets, and the: 

phonemes in the query and in the annotation are 
classified into their respective sets, then there is a 
better chance that there will be a match between the 
query and the annotation if they both sound alike. For 

15 example, if the following phoneme classifications are 

defined: 

Ci = {/s/ /z/ /sh/ /zh/} 
C2 = {/t/ /k/ /g/ /b/ /p/> 

20 Cs = {/eh/ /ih/ /iy/} 

and the query phonemes and the annotation phonemes in the 
above illustration are classified into these classes, 
then this will result in the following cluster trigrams : 
C(Q) = {Ci Cs C^} 

25 C(A) = {C, Cs C^} 



Therefore^ matching using the clustered query and 
annotation will now work since C(Q) = C(A) and the data 
can be retrieved. 

In this embodiment,, a hash indexing technique is used 
which tries to create a unique mapping between a key and 
an entry in a list. In this embodiment trigrams of the 
above phoneme clusters are used as the key to the index. 
The way that the phoneme index 35 is created and used 
will now be described in more detail. 

In order to create the phoneme index 35 the phoneme 
annotation data stored in the annotation database 31 is 
converted into phoneme cluster trigrams. The way that 
this is achieved is illustrated in Figure 3. In 
particular. Figure 3a schematically illustrates the block 
form of the annotation data which is stored in the 
database 31 for one of the documents (d) stored in the 
document database 29. As shown, the annotation data 
comprises successive blocks of data Bq*^ to Bm.i"^. As 
mentioned above, the annotation data within each block 
includes the nodes within the block and the phoneme and 
word links which are associated with the nodes. In order 
to illustrate the indexing method used in this 
embodiment, the remaining description will assume that 



the annotation data for document d includes a canonical 
sequence of phonemes i.e. one with no alternatives. 
Figure 3b illustrates the canonical sequence of phonemes 
within the i^*' block for the annotation data for document 
d. As shown, block. B^i comprises the canonical sequence 
of phonemes ao^^ to aNdi**^ which extend between nodes no"^^ to 
HNdi**^- Figure 3c illustrates the overlapping "cluster 
trigrams" 101 generated for the sequence of annotation 
phonemes in block B**i shown in Figure 3b. As shown in 
Figure 3c, the cluster in which each of the annotation 
phonemes in block B**i belongs is determined. Then 
cluster trigrams are determined from overlapping groups 
of three cluster identifications for the sequence of 
annotation phonemes. In particular, the first cluster 
trigrcim determined is C(a^^o) C(a*^^i) C(a^i2) then cluster 
trigram C(a^^i) C(a^^2) C(a'*^3) etc. Although not shown in 
Figure 3c, it is also necessary to consider the trigrcuns 
which bridge adjacent blocks. For example, in this 
embodiment, it would be necessary to consider the 
following cluster trigrcims : C ( a^^-^^jdi-D-i ) C ( a^^-^Ndi-i ) C(a^^o) 
and cluster trigram C ( a**^-^Ndi-i ) C(a^^o) C(a^^i). 

To create the index, a large table or array. A, having S 
entries is created. In this embodiment, each entry is 
addressed by an index (IDX) which takes a value between 
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zero and S-1 and each entry includes a data field to 
store the key (KEY) associated with the entry and a data 
field to store the pointers which identify the relevant 
locations within the annotation database where the 
5 phoneme data associated with the KEY can be found. 

Initially each of the entries is empty. The size of the 
table depends on the number of different phoneme clusters 
and the number of cluster identifications in the key. In 
this case, three cluster identifications (trigrams) are 

10 provided in each key. If there are ten clusters, then 

the number of different possible keys is 3^°. Therefore, 
in this case, S should, theoretically be made 
approximately equal to 3^°. However, in practice, some 
of the possible keys are unlikely to occur. Therefore, 

15 the size of the index can be set to have some initial 

size and data can be added to the index until more than 
a predetermined percentage of the entries are full. At 
this point, a bigger table can be created and the data 
from the old table copied over to the new table. This 

20 process can be repeated until there is no more data to be 

added to the index. Although this means that some memory 
is wasted, this is insignificant compared to the memory 
required for the pointers which will be stored in the 
table. 
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The way that data is added to the table will now be 
explained with reference to Figure 4. As shown, in step 
si, the system calculates the value of a function 
(f(KEY)) which is dependent upon the key, i.e. a function 
5 of the current cluster trigram, which value is used as 

the index (IDX) into the table A. In particular, the 
function f(KEY) defines a mapping between the cluster 
trigram and an entry in the table A and always yields a 
number between zero and S-1. In this embodiment, the 
10 function used is: 

[CmK^ C[2]K^ C[3]K^] mod S (1) 

where K^, is the number of phoneme clusters and C[l] is 
15 the number of the cluster to which the first annotation 

phoneme in the trigram belongs, C[2] is the number of the 
cluster to which the second annotation phoneme in the 
trigrcim belongs and C[3] is the number of the cluster to 
which the third annotation phoneme belongs. For example, 
20 for the illustration above where C(A) = {Ci Cg Cj}, C[l] 

= 1, C[2] = 5 and C[3] = 2. 

Once IDX has been determined in step si, the processing 
proceeds to step. s3 where the system checks the 
25 corresponding entry in the table. A, and determines 
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whether or not the key stored in that entry (KEY) is the 
same as the key for the current phoneme cluster trigram 
(key) or is the null key (indicating that this entry is 
empty). If in step s3 the system determines that the key 
stored in the entry (KEY) matches the current key (key) 
or the null key, then the processing proceeds to step s5 
where a pointer is added to that entry of the table. A, 
which points to the node associated with the first 
phoneme in the phoneme trigrcun associated with the 
current input key. For example, for the key c(ao'*^) 
c(ai^^) c(a2**^), the data which would be added to the table 
would be a pointer which points to the node no**"^ since 
annotation phoneme ao*^^ is associated with node no"*^. If 
the key stored in the entry is currently the null key, 
then in step s5, the system also changes the key for the 
entry (KEY) to the current key (key). The processing of 
the current cluster trigram then ends and a similar 
processing is performed on the next cluster trigram. If 
at step s3, the processing determines that the IDX^^ 
entry in the table. A, has already been assigned to a 
different cluster trigram, then the processing proceeds 
to step s7 where the system tries another entry in the 
table by changing the value of IDX in some predetermined 
way. In this embodiment, this is achieved by 

calculating: 



IDX = {IDX ^ V) mod S 



(2) 



where V is some fixed number which is not a factor of S 
(other than 1). The. reason that V should not be a factor 
of S is that this ensures that all entries in the table 
are tried. For example^ if S = 10 and V = 2 then this 
technique would simply keep trying either just the odd or 
just the even entries of table A. In order to avoid this 
problem, S should preferably be prime. After step si, 
the processing returns to step s3. 

Once all the cluster trigrams in all the blocks of each 
annotation have been processed in this way, the table. A, 
is stored as the phoneme index 35. Figure 5 illustrates 
the form of a phoneme index that is generated by the 
above processing. As can be seen from Figure 5, each 
entry in the table includes the index number (IDX) of the 
entry, the key (KEY) associated with the entry and (if 
the key is not the null key) one or more pointers 
pointing to nodes in the annotation database 31. As 
shown, in this embodiment, these pointers have the form 
n[p,q,r] where p is the annotation for document p, q is 
the q^^ block of nodes within that annotation data and r 
is the r^^ node within that block. 



The way that the phoneme matcher 33 uses the phoneme 
index 35 in response to an input query in order to 
identify portions of the annotation database 31 for 
matching with the input query will now be described with 
reference to Figures 6 to 8. 

When a user inputs a query the phoneme data generated 
either by the automatic speech recognition unit 51 or the 
phonetic transcription unit 75 (depending upon whether 
the input query was received through the microphone or 
through the keyboard) is input to the phoneme matcher 33. 
Figure 6 illustrates the processing steps performed by 
the phoneme matcher 33 on this phoneme data. As shown, 
in step sll, the phoneme matcher 33 converts the received 
query phoneme data into overlapping phoneme trigrcims . 
Figure 7a illustrates a sequence of query phonemes qo to 
qs representative of phoneme data received by the phoneme 
matcher 33 and Figure 7b illustrates how this sequence of 
query phonemes is converted into five overlapping 
trigrams of query phonemes 103. The processing then 
proceeds to step sl3 where each of the query phoneme 
trigrams is converted into phoneme cluster trigrams 105, 
as illustrated in Figure 7c, by classifying each of the 
query phonemes in the phoneme trigram into one of the 
above classes or clusters. The processing then proceeds 



to step sl5 where the phoneme matcher 33 uses each of the 
cluster trigrams generated for the input query to address 
the phoneme index 35 in order to identify relevant 
locations in the annotation database 31. 

Figure 8 illustrates the processing steps used by the 
phoneme matcher 33 in carrying out step sl5. As shown, 
in step sl51, the phoneme matcher 33 calculates the index 
(IDX) for the entry in the table. A, by inserting the 
current query cluster trigrsun into the function defined 
above in equation (1). The processing then proceeds to 
step sl53 where the phoneme matcher 33 checks the 
corresponding entry in the table. A, and determines 
whether or not the key stored in that entry (KEY) is the 
null key (indicating that this entry is empty). If it 
is, then the processing proceeds to step sl55 where the 
phoneme matcher 3 3 determines that there is no 
corresponding annotation in the annotation database and 
outputs an appropriate output to the combiner unit 40. 
The processing then ends - 

If at step sl5 3 the phoneme matcher 3 3 determines that 
the key stored in the entry (KEY) is not equal to the 
null key, the processing proceeds to step sl57 where the 
phoneme matcher 3 3 determines whether or not the key 



stored in the entry (KEY) is the same as the key for the 
current query cluster trigram (key). If it is then the 
processing proceeds to step sl59 where the phoneme 
matcher 33 retrieves the pointers from that entry. The 
processing then ends. If, however, the phoneme matcher 
33 determines, in step sl57, that the key for the entry 
(KEY) does not equal the key for the current query 
cluster trigram (key), then the processing proceeds to 
step sl61 where the phoneme matcher tries another entry 
in the table by changing the value of the index (IDX) 
using equation ( 2 ) given above and then returning to step 
sl53. 

Once the phoneme matcher 33 retrieves the pointers from 
the index or determines that there is no data stored for 
the current query cluster trigram, the phoneme matcher 3 3 
then performs a similar processing for the next query 
cluster trigram until all the query cluster trigrams have 
been processed in this way. The processing then proceeds 
to step sl7 shown in Figure 6, where the phoneme matcher 
3 3 uses the pointers identified in step sl5 to identify 
regions within the annotation database 31 which will be 
matched with the actual phoneme data received by the 
phoneme matcher 33. In this embodiment, these regions 
are identified by comparing the pointers retrieved in 



step sl59 for successive query cluster trigrcuns and 
looking for pointers which point to portions in the 
annotation database 31 which are next to each other. For 
example, referring to the phoneme index illustrated in 
Figure 5, if the n^^ query cluster trigram is C5C3C6 and 
the n+1^** query cluster trigram is CaCgCi, then the phoneme 
matcher 33 will identify node 40 of the 32^^* block of 
annotation data for the 3'^'* document as being a region of 
the annotation database for further processing in step 
sl9. This is because the pointers stored in the phoneme 
index 35 for the key C5C3C6 includes a pointer to node 
n[3,32,40] and the pointers stored in the phoneme index 
35 for the key CaCgCi includes a pointer to node 
n[3,32,41], which is immediately after node n[3, 32,40] 
and is therefore consistent with a portion of an 
annotation having successive cluster trigrams C5C3C6 and 
then C3C6C1 which may match with the input query. 

After the phoneme matcher 33 has identified the regions 
in step sl7, it performs a phoneme comparison between the 
received query phoneme data and the phoneme data stored 
in the annotation database 31 at the regions identified 
in step sl7. This phoneme comparison can be performed by 
comparing M-greims of the query with similar M-grams of 
the annotation (as described in the applicant's earlier 
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UK application GB 9905201,1, the content of which is 
incorporated herein by reference) or by performing a 
dyncunic programming comparison between the sequence of 
query phonemes and the sequence of annotation phonemes 
(using, for example, one of the techniques described in 
the applicant's earlier UK application GB 9925574.7, the 
content of which is incorporated herein by reference) . 
The results of these phoneme comparisons are then output 
to the combiner unit 40 where the results are combined 
with the output from the word matcher 3 7 in order to 
retrieve and rank the appropriate documents from the 
document database 29. 

In the above description, it has been assumed that the 
phonemes have been classified into a number of sets of 
conf usable phonemes. The way that these phoneme clusters 
are determined in this embodiment will now be described. 
If two phoneme decodings have been made, once during the 
annotation phase and once during the query phase, then 
the probability of the two decodings, pi and pj, coming 
from the same source is given by: 



(3) 



X 
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where P(p|x,m) is the probability of decoding phoneme x 



as phoneme p when decoding method m is used and P(x) is 
the probability of phoneme x occurring. The decoding 
methods mi and need to be distinguished since one of 
the decodings may come from a text-to-phoneme converter 
within the phonetic transcription unit 75 whilst the 
other may come from the automatic speech recognition unit 
51 and these two different decoding techniques will 
suffer from different types of confusion. Each of these 
probabilities in equation ( 3 ) above can be determined in 
advance during a training routine by applying known 
speech to the automatic speech recognition unit 51 and 
known text into the phonetic transcription unit 75 and by 
monitoring the output from the recognition unit 51 and 
the phonetic transcription unit 75 respectively. The way 
that such a training routine would be performed will be 
well known to those skilled in the art and will not, 
therefore, be described further here. 

If it is assumed that there are Kc phoneme clusters or 
classes and for each cluster there is a many-to-one 
mapping between decoded phonemes and the clusters. This 
mapping will depend on the decoding method employed. For 
example, an /s/ decoded by the automatic speech 
recognition unit 51 may be in cluster C4, but an /s/ 
decoded by the phonetic transcription unit 75 may be in 
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cluster Cj. Therefore, for any particular cluster (Ki) 
there will be a probability that two decodings from the 
same source are classed into that cluster which is 
determined by summing the probability given in equation 
(3) above for all possible combinations of decodings, pi 
and p2/ which belong to that cluster, i^e. by 
calculating : 

The probability that all decodings are correctly 
classified is therefore given by: 



p( All correctly\ ^ ^ p{ Assigned to 
V classified ) I, same cluster ') 



(5) 



The task of defining the clusters aims, therefore, to 
maximise P(all correctly classified), subject to the 
constraints that each phoneme (via a particular decoding 
method) is in one and only one cluster. In this 
20 embodiment, a Monte Carlo algorithm is used in order to 

determine phoneme classifications which maximise this 
probability . 



25 



Figure 9a illustrates the steps involved in this Monte 
Carlo algorithm. Initially, in step s200, the number of 



phoneme clusters that will be used is determined. As 
those skilled in the art will appreciate, if there are 
too few clusters then there will be insufficient 
discrimination and if there are too many clusters then 
the data may not be retrievable. In this embodiment, in 
order to provide classifications which are sufficiently 
discriminative, ten clusters are defined. Once the 
number of clusters has been determined, the system 
randomly assigns, in step s201, phonemes to these 
clusters and stores this as a current configuration. The 
processing then proceeds to step s203 where the system 
determines the probability that the . phonemes are 
correctly classified in the clusters for the current 
configuration, i.e. the system calculates the probability 
given in eguation ( 5 ) above . 

The processing then proceeds to step s205 where the 
system randomly selects a phoneme and a target cluster to 
which the selected phoneme may be moved. Then, in step 
s207, the system calculates what the probability given in 
equation (5) would be if the selected phoneme is moved 
into the target cluster. Then in step s209, the system 
compares this new probability calculated in step s207 
with the probability for the current configuration which 
was calculated in step s203. If the new probability is 
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higher than the probability for the current 
configuration, then the processing passes to step s211 
where the system moves the selected phoneme to the target 
cluster to replace the current configuration. The 
processing then proceeds to step s213 where the system 
determines whether or not the probability calculated for 
the new configuration is better than the "best ever" 
probability. If it is, then the processing proceeds to 
step s215 where the system stores the current 
configuration as the best ever configuration that it has 
encountered- Otherwise step s215 is skipped and the 
processing proceeds to step s219 where the system 
determines whether or not convergence has been reached. 

If at step s209, the system determines that the new 
probability is not higher than the probability for the 
current configuration, then the processing proceeds to 
step s217 where the system moves the selected phoneme to 
the target cluster to replace the current configuration 
with a probability dependent upon the difference between 
the new probability and the probability for the current 
configuration. For example^ a difference probability can 
be defined as: 



where Si is the probability determined for the current 
configuration and Sj is the probability for the proposed 
new configuration and X is a parameter which can be tuned 
to get best performance. Therefore, when the two 
probabilities are the same di = H and when the probability 
for the new configuration is massively worse than the 
probability for the current configuration d will be 
approximately equal to zero. Then if a random number 
generator is used which randomly picks a number between 
zero and one, then the proposed configuration will 
replace the current configuration if d > r, otherwise the 
proposed configuration is discarded. 

After step s217, the system proceeds to step s219 where 
it determines whether or not convergence has been 
reached. If it has, then the processing ends and the 
phoneme clusters are stored. If convergence has not been 
reached, then the processing proceeds to step s221 where 
a random value (RV) between zero and one is chosen and 
then compared with a threshold (Th) in step s223. If the 
random value, RV, is less than the threshold then the 
processing returns to step s205 above and the procedure 
is repeated. On the other hand, if the random value, RV 
is not less than the threshold, then the processing 
proceeds to step s225 where the best ever configuration 
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is copied into the current configuration and then the 
processing again returns to step s205. However, by 
setting the threshold Th to be close to one ensures that 
the best ever configuration is only copied into the 
5 current configuration very occasionally. As those 

skilled in the art will appreciate the processing of 
steps s221 to s225 is provided to try and ensure that the 
system does not remain stuck in a local minimum. 



10 The inventors have found that performing this clustering 

algorithm for the English phoneme set with ten clusters 
and for a given, user, when the decodings come from the 
automatic speech recognition unit 51, gives the following 



phoneme 


clusters : 




Ci 




{aa ae ah aw eh ey 


uw} 


C2 




{ao 1 oy r w} 




C3 




{d dh t th> 




C4 




{ax ea er hh oh ow 


ua} 


C5 




{m sil} 




Cg 




{b f p V} 




C7 




{s z} 




Cs 




{ch g jh k sh 2h> 




Cg 




{n ng} 




^10 




{ay ia ih iy uh y} 
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and when the decodings come from the phonetic 
transcription unit 75, gives the following phoneme 
clusters : 





Ci 




{aa ae ah aw eh ey uh uw} 


5 


Cj 




{ao 1 oy r. w} 




^3 




{d dh t th> 




C4 




{ax ea er hh oh ow ua} 








{m sil> 




Cs 




{b f p v> 


10 


C7 




{s z} 




Ca 




{ch g jh k sh zh} 




C9 




{n ng} 








{ay ia ih iy y} 



15 As those skilled in the art will appreciate, the phoneme 

clusters for the text to phoneme transcription unit 75 
are predominantly the same as those for the automatic 
speech recognition 51, with the exception of the "uh" 
phone which is in cluster Ci while it is in cluster Cg 

2 0 for the clusters of the automatic speech recognition unit 

51. As those skilled in the art will appreciate, the 
clusters given above are given by way of example only. 
The precise clusters that are used will depend on the 
matching method used to compare the phonemes in the 

25 clusters . 
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ALTERNATIVE EMBODIMENTS 

In the above embodiment, the document database 29, the 
annotation database 31 and the speech recognition unit 51 
were all located within the user terminal 59. As those 
5 skilled in the art will appreciate, this is not 

essential. Figure 10 illustrates an embodiment in which 
the document database 29 and the search . engine 53 are 
located in a remote server 60 and in which the user 
terminal 59 accesses the database 29 via the network 

10 interface unit 67 and 69 and a data network 68 (such as 

the Internet). In this embodJLment, both the documents 
and the annotations are stored in the database 29. In 
this embodiment, the user terminal 59 can only receive 
voice queries from the microphone 7. These queries are 

15 converted into phoneme and word data by the automatic 

speech recognition unit 51. This data is then passed to 
the control unit 55 which controls the transmission of 
data over the data network 6 8 to the search engine 53 
located within the remote server 60. The search engine 

20 53 then uses the phoneme index to carry out a search in 

the database 2 9 in a similar manner to the way in which 
the search was performed in the above embodiment. The 
results of the search are then transmitted back from the 
search engine 53 to the control unit 55 via the data 

25 network 68. The control^ unit "55 then considers the 



search results received back from the network and 
displays appropriate data on the display 57 for viewing 
by the user 39. 

In addition to locating the database 29 and the search 
engine 53 in the remote server 60^ it is also possible to 
locate the automatic speech recognition unit 51 in the 
remote server 60. Such an embodiment is shown in Figure 
11. As shown, in this embodiment, the input voice query 
from the user is passed via input line 61 to a speech 
encoding unit 73 which is operable to encode the speech 
for efficient transfer through the data network 68. The 
encoded data is then passed to the control unit 55 which 
transmits the data over the network 6 8 to remote server 
60, where it is processed by the automatic speech 
recognition unit 51. In this embodiment, the speech 
recognition unit is operable to only generate phoneme 
data which is then passed to the search engine for use in 
searching the database 29 using the phoneme index 35. 
The search results generated by the search engine are 
then passed, via the network interface 6 9 and the network 
68, back to the user terminal 59. The search results 
received back from the remote server are then passed via 
the network interface unit 67 to the control unit 55 
which analyses the results and generates and displays 
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appropriate data on the display 5 7 for viewing by the 
user 39. 

In a similar manner, a user terminal 59 may be provided 
which only allows typed inputs from the user and which 
has the search engine and the database located in the 
remote server. In such an embodiment, the phonetic 
transcription unit 75 may be located in the remote server 
60 as well. 



In the above embodiment, the annotation database and the 
document database were separate . . As those skilled in the 
art will appreciate, in some embodiments, the annotation 
database and the document database may form a single 
15 database. Additionally, the annotations in the 

annotation database may form the data to be retrieved. 

In the above embodiment, the annotation database was 
searched using both phonemes and words. As those skilled 
2 0 in the art will appreciate, the phoneme index described 

above and its use in searching the annotation database 
may be used in a system which does not search using words 
as well. 



25 



In the above embodiment, a phoneme index was described. 



As those skilled in the art will appreciate, the above 
technique can be used for features other than phonemes, 
such as phones, syllables or katakana (Japanese alphabet) 
or any other sub-word unit of speech. This indexing 
technique could also be used in other applications, such 
as the indexing of DNA sequences and the like. 

In the above embodiment, a hash indexing technique has 
been described. As those skilled in the art will 
appreciate, other indexing techniques can be used in 
order to identify portions of the database for carrying 
out a detailed phoneme search using phoneme data 
generated from the input query. 

In the above embodiment, the function defined in equation 
(1) above was used to define the mapping between the 
cluster trigraun and the entry in the table. However, 
with this technique, if C[l] or C[2] or C[3] equals zero 
then this will yield zero. Instead, the function used 
could be: 



(C[3] + K,(C[2] + /<,C[1]))mod S 
or, for a general N-gram of length n: 



(7) 



it Kt'^C[i] 
\ '=1 / 



mod S 



(8) 



In the above embodiment, the pointers stored in the index 
were of the form n[p,q,r], where p is the annotation for 
document p, q is the q^*" block of nodes within that 
annotation and r is the r^^ node within that block. As 
those skilled in the art will appreciate, other types of 
pointers could be used. For example, the count of the 
node since the start of the lattice or the time and rank 
(where more than one node have the same time) of the 
relevant nodes could be used. 

In the above embodiment, when the query is being applied 
to the index, each of the query cluster trigrams were 
applied to the index and the appropriate entries found. 
These entries were then compared in order to identify 
portions of the phoneme lattice for further searching. 
Alternatively, the processing of the query cluster 
trigrams may be performed incrementally, i.e. identify 
all the entries for the first query trigram, then obtain 
those for the second query trigram and retain those which 
are close in time to those of the first trigram etc. 

In the above embodiment, when identifying regions where 
successive cluster trigrams occur close in time, the 
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system compared the node numbers in the pointers which 
are retrieved. However, in some embodiments, it is 
better to compare the time offsets stored for the nodes. 
This is because in some applications, the lattice will 
5 have many branches and two nodes which are very close in 

time could have completely different node numbers. The 
aim of this part of the system is to identify cluster 
trigrams which are in chronological order within the 
annotation and which occur within a limited period of 
10 time of each other. In this way, if there is an error in 

a trigram, it will only result in two to three trigram 
misses. Consequently, the time leeway should be 
comparable to about four or five phonemes which is about 
0.2 to 0.3 seconds. 



15 
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CLAIMS ; 

1 . An apparatus for identif ying one or more portions of 
data in a database for comparison with a query input by 
a user, the query and the portions of data each 
comprising a sequence of sub-word units, the apparatus 
comprising: 

a memory for storing data defining a plurality of 
sub-word unit classes, each class comprising sub-word 
units that are confusable with other sub-word units in 
the same class; 

a memory for storing an index having a plurality of 
entries, each of which comprises: 

(i) an identifier for identifying the entry; 

(ii) a key associated with the entry and which is 
related to the identifier for the entry in a 
predetermined manner; and 

(iii) a number of pointers which point to portions 
of data in the database which correspond to the key for 
the entry; 

wherein each key comprises a sequence of sub-word 
unit classifications which is derived from a 
corresponding sequence of sub-word units appearing in the 
database by classifying each of the sub-word units in the 
sequence into one of the plurality of sub-word unit 
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classes ; 

means for classifying each of the sub-word units in 
the input query into one of the plurality of sub-word 
unit classes and for defining one or more sub-sequences 
5 of query sub-word unit classifications; 

means for determining a corresponding identifier for 
an entry in said index for each of said one or more sub- 
sequences of query sub-word unit classifications; 

means for comparing the key associated with each of 
10 the determined identifiers with the corresponding sub- 

sequence of query sub-word unit classifications; and 

means for retrieving one or more pointers from said 
index in dependence upon the output of said comparing 
means, which one or more pointers identify said one or 
15 more portions of data in the database for comparison with 

the input query. 

2. An apparatus according to claim 1, wherein said sub- 
word units are phonemes or phoneme-like units. 

20 

3. An apparatus according to claim 1 or 2, wherein at 
least ten sub-word unit classes are defined in advance. 



25 



4. An apparatus according to any preceding claim, 
wherein, each key is related to the corresponding 
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identifier by a predetermined mathematical function • 

5. An apparatus according to claim 4, wherein each key 
is related to the corresponding identifier by the 
following equation: 



n[c[i]K] 

2=1*- 



Mod S 



where Kc is the number of sub-word unit classes, S 
10 is the number of entries in the index, C[i] is the number 

of the sub-word class to which the i^^ sub-word unit in 
the sequence of sub-word units corresponding to the key 
belongs and W is the number of sub-word unit 
classifications in each key. 

15 

6. An apparatus according to any preceding claim, 
wherein said determining means is operable to identify a 
new identifier for another entry in said index for a 
subsequence of query sub-word unit classifications if 
2 0 said comparing means determines that the key for the 

identifier is not the same as the subsequence of query 
sub-word unit classifications. 
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7. An apparatus according to claim 6, wherein said 
determining means is operable to determine a new 



identifier using the following equation: 



IDX = [ IDX + V] Mod S 

where (IDX) is the identifier, S is the number of 
entries in the index and V is a predetermined number. 

8. An apparatus according to any preceding claim, 
wherein the key for one or more of said entries is a null 
key indicating that there are no pointers stored in the 
index for that entry* 

9. An apparatus according to claim 4 or 5, wherein said 
determining means is operable to determine a 
corresponding identifier for each subsequence of query 
sub-word unit classifications using said predetermined 
mathematical function. 

10. An apparatus according to any preceding claim, 
wherein said input query is a typed query and wherein the 
apparatus further comprises means for converting the 
typed query into said sequence of sub-word units. 

11. An apparatus according to any of claims 1 to 9, 
wherein said input query is a spoken query and wherein 



the apparatus further comprises a speech recognition 
system for processing the spoken query and for outputting 
said sequence of subword units. 

12 . An apparatus for searching a database in response to 
a query input by a user, the database comprising a 
plurality of sequences of sub-word units and the query 
comprising at least one sequence of sub-word units, the 
apparatus comprising : 

an apparatus according to any of claims 1 to 11 for 
identifying one or more portions of data in the database 
for comparison with the input query; and 

means for comparing the one or more sequences of 
query sub-word units with the identified one or more 
portions of data in said database. 

13. An apparatus according to claim 12, wherein said 
means for comparing said input query with said portions 
of data in the database uses a dynamic programming 
comparison technique. 

14. An apparatus according to claim 12 or 13, further 
comprising means for retrieving one or more data files in 
dependence upon the results of said comparing means. 
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15. An apparatus for identifying one or more portions of 
data in a database for comparison with a query input by 
a user, the query and the portions of data each 
comprising a sequence of features, the apparatus 
5 comprising: 

a memory for storing data defining a plurality of 
feature classes, each class comprising features that are 
conf usable with other features in the same class; 

a memory for storing an index having a plurality of 
10 entries, each of which comprises: 

(i) an identifier for identifying the entry; 

(ii) a key associated with the entry and which is 
related to the identifier for the entry in a 
predetermined manner; and 

(iii) a number of pointers which point to portions 
of data in the database which correspond to the key for 
the entry; 

wherein each key comprises a sequence of feature 
classifications which is derived from a corresponding 
20 sequence of features appearing in the database by 

classifying each of the features in the sequence into one 
of the plurality of feature classes; 

means for classifying each of the features in the 
input query into one of the plurality of feature classes 
15 and for defining one or more sub-sequences of query 



feature classifications; 

means for determining a corresponding identifier for 
an entry in said index for each of said one or more sub- 
sequences of query feature classifications; 

means for comparing the key associated with each of 
the determined identifiers with the corresponding sub- 
sequence of query feature classifications; and 

means for retrieving one or more pointers from said 
index in dependence upon the output of said comparing 
means, which one or more pointers identify said one or 
more portions of data in the database for comparison with 
the input query. 

16. Data defining an index for use in searching a 
database, the data comprising: 

data defining a respective identifier for each of a 
plurality of entries in the index; 

data defining a respective key for each of the 
plurality of entries, which keys are related to the 
corresponding identifiers in a predetermined manner; and 

data defining a respective one or more pointers for 
a plurality of the entries, which pointers point to 
locations within the database corresponding to the key 
for the entry; 

wherein each key comprises a sequence of sub-word 
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unit classifications which is derived from a 
corresponding sequence of sub-word units appearing in the 
database by classifying each of the sub-word units in the 
sequence into one of a plurality of sub-word unit 
classes, the sub-word unit classes being defined in 
advance and each comprising sub-word units that are 
conf usable with other sub-word units in the same class. 

17. A method of identifying one or more portions of data 
in a database for comparison with a query input by a 
user, the query and the portions of data each comprising 
a sequence of sub-word units, the method comprising the 
steps of: 

storing data defining a plurality of sub-word unit 
classes, each class comprising sub-word units that are 
conf usable with other sub-word units in the same class; 

storing an index having a plurality of entries, each 
of which comprises: 

(i) an identifier for identifying the entry; 

(ii) a key associated with the entry and which is 
related to the identifier for the entry in a 
predetermined manner; and 

(iii) a number of pointers which point to portions 
of data in the database which correspond to the key for 

25 the entry; 
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wherein each key comprises a sequence of sub-word 
unit classifications which is derived from a 
corresponding sequence of sub-word units appearing in the 
database by classifying each of the sub-word units in the 
5 sequence into one of the plurality of sub-word unit 

classes; 

classifying each of the sub-word units in the input 
query into one of the plurality of sub-word unit classes 
and for defining one or more sub-sequences of query sub- 
10 word unit classifications; 

determining a corresponding identifier for an entry 
in said index for each of said one or more sub-sequences 
of query sub-word unit classifications; 

comparing the key associated with each of the 
15 determined identifiers with the corresponding sub- 

sequence of query sub-word unit classifications; and 

retrieving one or more pointers from said index in 
dependence upon the output of said comparing step, which 
one or more pointers identify said one or more portions 
of data in the database for comparison with the input 
query . 
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18. A method according to claim 17, wherein said sub- 
word units are phonemes or phoneme-like units. 
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19. A method according to claim 17 or 18, wherein at 
least ten sub-word unit classes are defined in advance. 

20. A method according to any of claims 17 to 19, 
wherein each key is related to the corresponding 
identifier by a predetermined mathematical function. 

21. A method according to claim 20, wherein each key is 
related to the corresponding identifier by the following 



10 equation: 



n[c[i]K] 



Mod S 



where is the number of sub-word unit classes, S 
is the number of entries in the index, C[i] is the number 
of the sub-word class to which the x^*^ sub-word unit in 
the sequence of sub-word units corresponding to the key 
belongs and W is the number of sub-word unit 
classifications in each key. 

22. A method according to any of claims 17 to 21, 
wherein said determining step identifies a new identifier 
for another entry in said index for a subsequence of 
query sub-word unit classifications if said comparing 
step determines that the key for the identifier is not 



I 
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the same as the subsequence of query sub-word unit 
classifications . 



23. A method according to claim 22, wherein said 
determining step determines a new identifier using the 
following equation: 

IDX = [ IDX + V] Mod S 

where IDX is the identifier, S is the number of 
entries in the index and V is a predetermined number. 

24. A method according to any of claims 17 to 23, 
wherein the key for one or more of said entries is a null 
key indicating that there are no pointers stored in the 
index for that entry. 

25. A method according to claim 23 or 24, wherein said 
determining step determines a corresponding identifier 
for each subsequence of query sub-word unit 
classifications using said predetermined mathematical 
function . 

26. A method according to any of claims 17 to 25, 
wherein said input query is a typed query and wherein the 



method further comprises the step of converting the typed 
query into said sequence of sub-word units. 

27. A method according to any of claims 17 to 25, 
Wherein said input query is a spoken query and wherein 
the method further comprises the step of using a speech 
recognition system to process the spoken query to 
generate said sequence of subword units. 

28. A method of searching a database in response to"^ a 
query input by a user, the database comprising a 
plurality of sequences of sub-word units and the query 
comprising at least one sequence of sub-word units, the 
method comprising: 

the method steps of any of claims 17 to 27 for 
identifying one or more portions of data in the database 
for comparison with the input query; and the step of 

comparing the one or more sequences of query sub- 
word units With the identified one or more portions of 
data in said database. 

29. A method according to claim 28, wherein said 
comparing step uses a dynamic programming comparison 
technique to compare the input query with said portions 
of data. 



m 
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30. A method according to claim 28 or 29, further 
comprising the step of retrieving one or more data files 
in dependence upon the results of said comparing step. 

5 31. A storage medium storing processor implementable 

instructions for controlling a processor to implement the 
method of any one of claims 17 to 30 or storing the data 
of claim 16. 

10 32. Processor implementable instructions for controlling 

a processor to implement the method of any one of claims 
17 to 30. 
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An apparatus for identifying one or more portions of 
data in a database for comparison with a query input by 
a user, the query and the portions of data each 
comprising a sequence of sub-word units, the apparatus 
being characterised by an index having a plurality of 
entries, each of which includes a key comprising a 
sequence of sub-word unit classifications, which key is 
derived from a corresponding sequence of sub-word units 
appearing in the database by classifying each of the sub- 
word units in the sequence into one of a plurality of 
sub-word unit classes, each class comprising sub-word 
25 units that are confusable with other sub-word units in 

the same class. 



ABSTRACT 



INDEXIN G METHOD AND APPARATUS 

An indexing apparatus and method are described for use in 
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